safety value function
AnySafe: Adapting Latent Safety Filters at Runtime via Safety Constraint Parameterization in the Latent Space
Agrawal, Sankalp, Seo, Junwon, Nakamura, Kensuke, Tian, Ran, Bajcsy, Andrea
Recent works have shown that foundational safe control methods, such as Hamilton-Jacobi (HJ) reachability analysis, can be applied in the latent space of world models. While this enables the synthesis of latent safety filters for hard-to-model vision-based tasks, they assume that the safety constraint is known a priori and remains fixed during deployment, limiting the safety filter's adaptability across scenarios. To address this, we propose constraint-parameterized latent safety filters that can adapt to user-specified safety constraints at runtime. Our key idea is to define safety constraints by conditioning on an encoding of an image that represents a constraint, using a latent-space similarity measure. The notion of similarity to failure is aligned in a principled way through conformal calibration, which controls how closely the system may approach the constraint representation. The parameterized safety filter is trained entirely within the world model's imagination, treating any image seen by the model as a potential test-time constraint, thereby enabling runtime adaptation to arbitrary safety constraints. In simulation and hardware experiments on vision-based control tasks with a Franka manipulator, we show that our method adapts at runtime by conditioning on the encoding of user-specified constraint images, without sacrificing performance. Video results can be found on https://any-safe.github.io
Imposing Exact Safety Specifications in Neural Reachable Tubes
Singh, Aditya, Feng, Zeyuan, Bansal, Somil
Hamilton-Jacobi (HJ) reachability analysis is a verification tool that provides safety and performance guarantees for autonomous systems. It is widely adopted because of its ability to handle nonlinear dynamical systems with bounded adversarial disturbances and constraints on states and inputs. However, it involves solving a PDE to compute a safety value function, whose computational and memory complexity scales exponentially with the state dimension, making its direct usage in large-scale systems intractable. Recently, a learning-based approach called DeepReach, has been proposed to approximate high-dimensional reachable tubes using neural networks. While DeepReach has been shown to be effective, the accuracy of the learned solution decreases with the increase in system complexity. One of the reasons for this degradation is the inexact imposition of safety constraints during the learning process, which corresponds to the PDE's boundary conditions. Specifically, DeepReach imposes boundary conditions as soft constraints in the loss function, which leaves room for error during the value function learning. Moreover, one needs to carefully adjust the relative contributions from the imposition of boundary conditions and the imposition of the PDE in the loss function. This, in turn, induces errors in the overall learned solution. In this work, we propose a variant of DeepReach that exactly imposes safety constraints during the learning process by restructuring the overall value function as a weighted sum of the boundary condition and neural network output. This eliminates the need for a boundary loss during training, thus bypassing the need for loss adjustment. We demonstrate the efficacy of the proposed approach in significantly improving the accuracy of learned solutions for challenging high-dimensional reachability tasks, such as rocket-landing and multivehicle collision-avoidance problems.
Safety-aware Policy Optimisation for Autonomous Racing
Chen, Bingqing, Francis, Jonathan, Herman, James, Oh, Jean, Nyberg, Eric, Herbert, Sylvia L.
To be viable for safety-critical applications, such as autonomous driving and assistive robotics, autonomous agents should adhere to safety constraints throughout the interactions with their environments. Instead of learning about safety by collecting samples, including unsafe ones, methods such as Hamilton-Jacobi (HJ) reachability compute safe sets with theoretical guarantees using models of the system dynamics. However, HJ reachability is not scalable to high-dimensional systems, and the guarantees hinge on the quality of the model. In this work, we inject HJ reachability theory into the constrained Markov decision process (CMDP) framework, as a control-theoretical approach for safety analysis via model-free updates on state-action pairs. Furthermore, we demonstrate that the HJ safety value can be learned directly on vision context, the highest-dimensional problem studied via the method to-date. We evaluate our method on several benchmark tasks, including Safety Gym and Learn-to-Race (L2R), a recently-released high-fidelity autonomous racing environment. Our approach has significantly fewer constraint violations in comparison to other constrained RL baselines, and achieve the new state-of-the-art results on the L2R benchmark task.
Learning Modular Safe Policies in the Bandit Setting with Application to Adaptive Clinical Trials
Aboutalebi, Hossein, Precup, Doina, Schuster, Tibor
The stochastic multi-armed bandit problem is a well-known model for studying the exploration-exploitation trade-off. It has significant possible applications in adaptive clinical trials, which allow for dynamic changes in the treatment allocation probabilities of patients. However, most bandit learning algorithms are designed with the goal of minimizing the expected regret. While this approach is useful in many areas, in clinical trials, it can be sensitive to outlier data, especially when the sample size is small. In this paper, we define and study a new robustness criterion for bandit problems. Specifically, we consider optimizing a function of the distribution of returns as a regret measure. This provides practitioners more flexibility to define an appropriate regret measure. The learning algorithm we propose to solve this type of problem is a modification of the BESA algorithm [Baransi et al., 2014], which considers a more general version of regret. We present a regret bound for our approach and evaluate it empirically both on synthetic problems as well as on a dataset from the clinical trial literature. Our approach compares favorably to a suite of standard bandit algorithms.